Senior Project Report: 2024-2025
Senior Project 2025: Analyzing the Growth Rates of Trees in the Ecosystem Preserve
Davis Addink, Sucry Bendeck, Felicia Susanto Department of Computer Science, Calvin University DS 398: Senior Project in Computing Professor Keith VanderLinden May 8, 2025
Problem Description and Research Question
How would variables such as species, tree size (DBH), and location affect tree growth rate, and are there any other variables that would affect GR that are not yet added in our model? We aimed to answer this research question using 50 years of tree measurements from the Calvin Ecosystem Preserve Tree Census, along with data we obtained on local weather and topography.
There are many variables that need to be processed, and we want to use them to analyze and predict the growth rate of trees in the ecosystem preserve. The data is extensive and involves factors recorded over time, requiring both cleaning and modeling to generate meaningful insights.
Background on the Problem
We have just reached the 50th anniversary of the Calvin Ecosystem Preserve and Native Garden’s Tree Census, which includes a plethora of variables such as quadrat number, status, year, stems, species, and tree diameter. The ecosystem preserve includes a patch of land that hasn’t been plowed, preserving its natural state, and data is taken every five years from this area. More importantly, it provides an opportunity to analyze the state and evolution of natural land in West Michigan.
Previous Work
The tree census was first started by Calvin professor Allen Gebben and John Ubels in 1974. More recently, it has been overseen by professor Randall Van Dragt, working with many student volunteers and Ecosystem Preserve Stewards.
Summer research students and professors have continuously collected data on the trees and input them every 5 years. The summer research students before us (summer 2024) were Noelle H and Ella A, and they cleaned up the dataset, did exploratory data analysis, and made visualizations. They also helped organize the structure of GitHub and provided some insights into the initial relationships between variables, which serve as a foundation for further analysis.
Problems Encountered and Solutions
Identification and Integration of Auxiliary Data
In the early stages of modeling tree growth rates, we realized that our dataset lacked several key predictors known to influence growth. After doing further research, we identified multiple important variables that were missing and could help improve the predictive accuracy of our models.
First, we added Growing Degree Days (GDD), a variable that captures both the intensity and duration of the growing season. GDD is calculated using the formula:
(Max Temperature + Min Temperature)/2 + Base Temperature)
For Michigan, the base temperature is 50°F, according to Michigan State University. Including GDD helped us incorporate seasonal temperature variation into our model, which is relevant for tree growth.
We also included precipitation as another climate-related variable, since water availability plays a direct role in plant development.
Next, we integrated LiDAR data obtained from the United States Geological Survey (USGS). This provided us with accurate elevation information for each tree’s location within the ecosystem preserve.
Finally, we calculated the Shannon diversity index to capture local biodiversity around each tree. Using a 30-meter radius, we counted how many trees of different species were within the specified radius and used that to compute a diversity score. This variable allowed us to measure how the diversity of surrounding trees might influence an individual tree’s growth rate.
By adding these variables, we were able to build a more biologically meaningful model that better reflects the environmental and ecological factors influencing tree growth.
Interactive Spatial Graphic
Another challenge we faced was with the interactive spatial graphic we developed to visualize tree growth over time across the Ecosystem Preserve. One major issue was with the coordinate system. We had trouble translating the spatial coordinates in the dataset into accurate longitude and latitude values. As a result, the orientation of our spatial map was initially distorted, which made the data visualization less intuitive and geographically misleading. Additionally, we discovered an important historical detail in the dataset: in 1994, there was a quadrat expansion in the Ecosystem Preserve.
This expansion added a significant number of trees that were outside the original survey area, meaning those trees were not newly sprouted but had simply never been recorded before. Because their first measurement occurred in 1994, many of them appeared to have suddenly grown to a large size in a very short time, especially along the outer edges of the preserve.
To address this and avoid misleading interpretations, we made a design choice in our interactive graphic. We displayed the trees measured in the quadrat expansion using very small dots, placed precisely at their recorded coordinates. By doing this, we hoped to communicate that these trees didn’t grow from seedlings to full size in just one year, but rather that they had already existed and were simply measured for the first time in 1994 with an already large DBH. The alternative to this decision was to have the newly added trees simply appear when their measurements were first taken. However when doing so, we ran into a gg_plot bug, where the trees would appear to visually move. We decided to use the lesser of two “evils” and have the trees start off as very small points as place holders.
Non-Linear Model and ACFs
As we were building our model, we recognized the importance of making sure it was appropriate for our data and reliable enough to draw valid conclusions. This required performing a series of model assessments to evaluate the assumptions behind our approach and identify any structural issues.
We began with a linear model to predict tree growth rates. However, after running diagnostic checks, including a lack-of-linearity test, a response vs. fitted values plot, and an autocorrelation function (ACF) analysis, it became clear that the model wasn’t adequately capturing the data structure. The ACF plot, in particular, showed evidence of autocorrelation on most of the lags, suggesting that the residuals were not independent. This is a problem because linear models assume that residuals are independent and identically distributed, and violating this can bias inference. We suspected this was happening because the true relationship between the predictors and growth rate wasn’t linear.
Tree growth is influenced by complex, nonlinear ecological processes, and forcing a linear model on that structure led to poor performance. As a result, we transitioned to a Generalized Additive Model (GAM) to better capture nonlinearity. Because of the size of our dataset, we used the bam() function from R package mgcv (Wood et al. 2011), which is designed for larger datasets while retaining the flexibility of GAMs. After switching to bam(), we ran another round of model assessments. We tested for linearity, independence of residuals, and checked the ACF plot again. This time, we noticed a spike in autocorrelation at lag 2, indicating that some residual correlation remained. To address this, we revisited our variable selection and identified predictors that were still missing from the model. We made several improvements: We added a smooth function of year to account for long-term trends in tree growth over time.
We also smoothed map, the spatial identifier of each tree, to help model unobserved individual variation.
We introduced an interaction between species and nearest-tree distance to investigate how different species respond to neighborhood and spacing from nearby trees.
We included another interaction between species and DBH (diameter at breast height) to capture species-specific growth patterns based on initial tree size.
Lastly, we added a random effect of map. This allowed us to account for unmeasured, tree-specific variation in environmental conditions that could affect all trees within the same map section, while avoiding over fitting by not treating map as a fixed effect.
Stan Fitting and Adding Measurement Error
Towards the end of this semester, we were hoping to use Bayesian Statistics to include a growth rate measurement error using a stan function, using our learning from Computational Bayesian Statistics, taught by Professor DeRuiter. Unfortunately, due to time and data constraints, we were not able to fulfill this goal. The data we have is not built to have a standardized error. Ideally, each tree would be measured more than once per measurement year, so that human error could be seen year to year. Otherwise, it would be hard to tell if the difference between measurements was a result of human error or actual growth.
Results, Analysis, and Discussion
Results
Note: While we have hidden all R code used to read and prepare data, create visualizations, and fit models, the interested reader can click the “</>Code” button at the top right of the page to see all the source code.
Exploratory Data Analysis
Once we had all of our variables and data organized, we wanted to make initial graphs to explore some potential relationships, as well as visualize the location of where our data is coming from.
Elevation
The elevation of the data area can be seen below, overplayed over Google maps. This was achieved using the LiDar data.
Shannon Diversity Index
Using the Shannon Diversity Index discussed above, we can visualize where in the data area is there more diversity. The lighter colors represent more diversity, and the darker show less diversity. It can be seen in the graph below that the edges of the data area have more diversity than the core. This is likely because of the small stream that runs through the center of the data area, making it harder for some species to grow.
Size vs Competition
Next, we wanted to explore the idea of different species of trees growing larger based on distance of the next nearest tree. Not many inferences can be made, as some of the plots below suffer from over plotting.
Precipitation effect on Growth Rate
Similarly, we wanted to see if the precipitation of each year had an initial noticeable effect on growth rate, but as in the plot above, no real inferences can be made.
GDD vs. DBH
Additionally, we wanted to see if the cumulative GDD had a noticeable effect on the DBH of each tree. Again, no inferences could be made because there are so many points with the same value for GDD. The same can be said for the following graph, precipitation vs DBH.
Our last EDA figure is the interactive spatial plot as discussed above. Each point is plotted using the trees coordinates, the color corresponds to the species of the trees, and the size to the DBH of the tree. The user can scroll the legend on the right to see all species, then double click a specific species to isolate it. To see all the species again they simply need to double click the species again. There is a time slider at the bottom of the graph to show the change of DBH over time. The user can hit the play button or drag the dot to show a specific year.The tiny points around the perimeter are only placeholders for trees that will be measured in the future. When they appear to grow at an alarming rate, that is really because it is the first real measurement for those specific trees.
Model Results
As stated in the “problems encountered” section, we modeled trees’ annula growth rate using a Generalized Additive Model (GAM) via the function bam() from R package mgcv (Wood et al. 2011).
Fixed and Smooth Terms
The model included the following variables:
Cumulative and lagged GDD
Cumulative and lagged Precipitation
Elevation
Shannon diversity index
Distance to the nearest neighboring tree
DBH
Species
Year
Location
We also included two interactions:
Species and DBH
Species and Nearest tree distance
Additionally, the model included a random intercept for tree ID to account for tree-to-tree differences in growth rates.
Significant Predictors Identified via ANOVA
DBH (p < 2e-16): Larger tree experienced higher growth rates, indicating that initial size is a strong predictor of subsequent growth.
Species (p < 2e-16): Species identity significantly influenced growth rate, supporting the hypothesis that different species have inherently different growth dynamics.
PRCP_Lag1 (p < 0.001): Lagged precipitation showed a small but statistically significant negative effect on growth, suggesting that high rain in the previous year can lead to lower growth rate in the current year.
Interaction Effects
Species and DBH: This interaction was significant, indicating that the effect of size on growth varies by species. Some species benefit more from increased size than others.
Species and Nearest tree distance: This interaction was not statically significant, implying that the influence of neighborhood density on growth is similar across species.
Note: Although model assessment is not included here, the interested reader can access assessments through “data modeling attempt.qmd” in the “spring senior project” folder in the repo.
Analysis
The fitted BAM model provides several important insights into the ecological process influencing tree growth. One of the clearest signals is the role of tree size in that larger trees tend to experience greater annual increases in diameter. This aligns with ecological theory, suggesting that larger trees are better at converting resources into growth because of their established root system.
Species identity also came out as a key factor. Different species exhibited different growth patterns, and the significant interaction between species and DBH indicates that the size-growth relationship is not uniform across species.
Climatic influences were also evident. Lagged precipitation had a small but significant negative effect on current year growth. This could be explained by reduced oxygen availability in the soil, or altered microbiological activity following heavy rainfall.
Discussion
These findings offer important ecological insights into the driving factors of tree growth. The strong predictive power of species and initial tree size highlights the important of intrinsic biological traits in shaping individual growth trajectories. The significant prior year precipitation emphasizes the need to consider lagged climate variables in growth models, particularly in the context of increasing climate variability.
The species-specific response to size reflects ecological trade-offs, such as different species investing choosing to invest more in defense, growth, water conservation, or light capture. This means when planning forest management or reforestation, it’s not enough to look at average growth rates alone. We also need to consider how growth changes with tree size and how each species reacts to its environment.
Lastly, while neighborhood crowding is often assumed to influence growth because of competition for light or water, our model found limited significance. This suggests that we need more data, specifically with data on the canopy coverage.
In conclusion, the BAM approach allowed us to integrate multiple predictors and capture the variation of growth in trees from year to year. The ability to model nonlinear trends and spatial clustering proved valuable for revealing hidden patterns that simpler models might overlook. As climate and forest conditions continue to shift, flexible modeling approaches like the one we have produced will be helpful in understanding and managing forest ecosystems.
Possible Future Work
As a team, we thought about several future project ideas that could benefit Calvin and assist future students working on this project. Our first idea is to create a public website where Calvin students, alumni, and anyone interested in the nature preserve can access information about the preserve and ongoing research.
Additionally, the nature preserve will host an event in Fall 2025, in which the visualizations created for this project will be especially useful. These visualizations will help showcase the preserve’s exact location and illustrate how the trees have grown over the past 50 years.
Future teams could consider incorporating data on the variable height of the tree canopy. This would be valuable because a tree’s canopy can influence neighboring trees by affecting the amount of sunlight and precipitation they receive.
Lastly, as previously stated, we explored the idea of incorporating measurement error into the data. We pursued this because we suspect that some DBH measurements may contain inaccuracies that could be affecting the model’s performance and reliability. To address this, future work could refine our modeling approach to additionally account for measurement error.
Note: The interested reader can find this report under “stat 341 report” in the “spring senior project” folder.
Acknowledgements
We would like to thank the faculty, staff, and all the students who have gone before us and collected and organized the data. Additionally, we would like to thank and acknowledge Randy Van Dragt, who is Professor Emeritus and also a preserve senior fellow. Matthew Dykstra (managing director of the Calvin Ecosystem Preserve and Native Gardens). Prof. Stacy DeRuiter, who is our senior advisor and professor for computational Bayesian statistics
Works Cited
Growing degree day information. Integrated Pest Management. (n.d.). https://www.canr.msu.edu/ipm/agriculture/christmas_trees/growing_degree_day_information
Wood, S.N. (2011) Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. Journal of the Royal Statistical Society (B) 73(1):3-36
USGS LIDAR explorer map. (n.d.). https://apps.nationalmap.gov/lidar-explorer/#/